In this module, we will show you how to:
- Reshape data from wide (fat) to long (tall)
- Reshape data from long (tall) to wide (fat)
- Merge Data/Joins
- Perform operations by a grouping variable
In this module, we will show you how to:
Data is stored differently in the tibble.
Wide: has many columns
# A tibble: 1 x 4 State June_vacc_rate May_vacc_rate April_vacc_rate <chr> <chr> <chr> <chr> 1 Alabama 37.2% 36.0% 32.4%
Long: column names become data
# A tibble: 3 x 3 State name value <chr> <chr> <chr> 1 Alabama June_vacc_rate 37.2% 2 Alabama May_vacc_rate 36.0% 3 Alabama April_vacc_rate 32.4%
Wide: multiple columns per individual, values spread across multiple columns
# A tibble: 2 x 4 State June_vacc_rate May_vacc_rate April_vacc_rate <chr> <chr> <chr> <chr> 1 Alabama 37.2% 36.0% 32.4% 2 Alaska 47.5% 46.2% 41.7%
Long: multiple rows per observation, a single column contains the values
# A tibble: 6 x 3 State name value <chr> <chr> <chr> 1 Alabama June_vacc_rate 37.2% 2 Alabama May_vacc_rate 36.0% 3 Alabama April_vacc_rate 32.4% 4 Alaska June_vacc_rate 47.5% 5 Alaska May_vacc_rate 46.2% 6 Alaska April_vacc_rate 41.7%
Data is wide or long with respect to certain variables.
Wide: Easier for humans to read
# A tibble: 2 x 4 State June_vacc_rate May_vacc_rate April_vacc_rate <chr> <chr> <chr> <chr> 1 Alabama 37.2% 36.0% 32.4% 2 Alaska 47.5% 46.2% 41.7%
Long: Easier for R to make plots & do analysis
# A tibble: 6 x 3 State name value <chr> <chr> <chr> 1 Alabama June_vacc_rate 37.2% 2 Alabama May_vacc_rate 36.0% 3 Alabama April_vacc_rate 32.4% 4 Alaska June_vacc_rate 47.5% 5 Alaska May_vacc_rate 46.2% 6 Alaska April_vacc_rate 41.7%
http://jhudatascience.org/intro_to_r/data/Charm_City_Circulator_Ridership.csv
circ = read_csv(
paste0("http://jhudatascience.org/intro_to_r/",
"data/Charm_City_Circulator_Ridership.csv"))
head(circ, 5)
# A tibble: 5 x 15 day date orangeBoardings orangeAlightings orangeAverage purpleBoardings <chr> <chr> <dbl> <dbl> <dbl> <dbl> 1 Monday 01/11/… 877 1027 952 NA 2 Tuesday 01/12/… 777 815 796 NA 3 Wednes… 01/13/… 1203 1220 1212. NA 4 Thursd… 01/14/… 1194 1233 1214. NA 5 Friday 01/15/… 1645 1643 1644 NA # … with 9 more variables: purpleAlightings <dbl>, purpleAverage <dbl>, # greenBoardings <dbl>, greenAlightings <dbl>, greenAverage <dbl>, # bannerBoardings <dbl>, bannerAlightings <dbl>, bannerAverage <dbl>, # daily <dbl>
tidyr allows you to “tidy” your data. We will be talking about:
pivot_longer - make multiple columns into variables, (wide to long)pivot_wider - make a variable into multiple columns, (long to wide)separate - string into multiple columnsunite - multiple columns into one stringThe reshape command exists. It is a confusing function. Don’t use it.
tidyr::pivot_longer - puts column data into rows.
names_to = gives a new name to the pivoted columnsvalues_to = gives a new name to the values that used to be in those columnslong = circ %>%
pivot_longer(starts_with(c("orange","purple","green","banner")),
names_to = "var", values_to = "number")
long
# A tibble: 13,752 x 5 day date daily var number <chr> <chr> <dbl> <chr> <dbl> 1 Monday 01/11/2010 952 orangeBoardings 877 2 Monday 01/11/2010 952 orangeAlightings 1027 3 Monday 01/11/2010 952 orangeAverage 952 4 Monday 01/11/2010 952 purpleBoardings NA 5 Monday 01/11/2010 952 purpleAlightings NA 6 Monday 01/11/2010 952 purpleAverage NA 7 Monday 01/11/2010 952 greenBoardings NA 8 Monday 01/11/2010 952 greenAlightings NA 9 Monday 01/11/2010 952 greenAverage NA 10 Monday 01/11/2010 952 bannerBoardings NA # … with 13,742 more rows
We have many columns here, so we could instead use the ! to say which columns we don’t want to pivot.
long = circ %>% pivot_longer(!c(day, date, daily),
names_to = "var", values_to = "number")
long
# A tibble: 13,752 x 5 day date daily var number <chr> <chr> <dbl> <chr> <dbl> 1 Monday 01/11/2010 952 orangeBoardings 877 2 Monday 01/11/2010 952 orangeAlightings 1027 3 Monday 01/11/2010 952 orangeAverage 952 4 Monday 01/11/2010 952 purpleBoardings NA 5 Monday 01/11/2010 952 purpleAlightings NA 6 Monday 01/11/2010 952 purpleAverage NA 7 Monday 01/11/2010 952 greenBoardings NA 8 Monday 01/11/2010 952 greenAlightings NA 9 Monday 01/11/2010 952 greenAverage NA 10 Monday 01/11/2010 952 bannerBoardings NA # … with 13,742 more rows
long %>% count(var)
# A tibble: 12 x 2 var n <chr> <int> 1 bannerAlightings 1146 2 bannerAverage 1146 3 bannerBoardings 1146 4 greenAlightings 1146 5 greenAverage 1146 6 greenBoardings 1146 7 orangeAlightings 1146 8 orangeAverage 1146 9 orangeBoardings 1146 10 purpleAlightings 1146 11 purpleAverage 1146 12 purpleBoardings 1146
We will use str_replace from the stringr package to put _ in the names
long = long %>% mutate( var = str_replace(var, "Board", "_Board"), var = str_replace(var, "Alight", "_Alight"), var = str_replace(var, "Average", "_Average") ) long
# A tibble: 13,752 x 5 day date daily var number <chr> <chr> <dbl> <chr> <dbl> 1 Monday 01/11/2010 952 orange_Boardings 877 2 Monday 01/11/2010 952 orange_Alightings 1027 3 Monday 01/11/2010 952 orange_Average 952 4 Monday 01/11/2010 952 purple_Boardings NA 5 Monday 01/11/2010 952 purple_Alightings NA 6 Monday 01/11/2010 952 purple_Average NA 7 Monday 01/11/2010 952 green_Boardings NA 8 Monday 01/11/2010 952 green_Alightings NA 9 Monday 01/11/2010 952 green_Average NA 10 Monday 01/11/2010 952 banner_Boardings NA # … with 13,742 more rows
Now each var is Boardings, Averages, or Alightings. We use “into =” to name the new columns and “sep =” to show where the separation should happen.
long =
long %>%
separate(var, into = c("line", "type"), sep = "_")
long
# A tibble: 13,752 x 6 day date daily line type number <chr> <chr> <dbl> <chr> <chr> <dbl> 1 Monday 01/11/2010 952 orange Boardings 877 2 Monday 01/11/2010 952 orange Alightings 1027 3 Monday 01/11/2010 952 orange Average 952 4 Monday 01/11/2010 952 purple Boardings NA 5 Monday 01/11/2010 952 purple Alightings NA 6 Monday 01/11/2010 952 purple Average NA 7 Monday 01/11/2010 952 green Boardings NA 8 Monday 01/11/2010 952 green Alightings NA 9 Monday 01/11/2010 952 green Average NA 10 Monday 01/11/2010 952 banner Boardings NA # … with 13,742 more rows
If we had the opposite problem, we could use the unite function:
reunited = long %>% unite(var, line, type, sep = "_") reunited
# A tibble: 13,752 x 5 day date daily var number <chr> <chr> <dbl> <chr> <dbl> 1 Monday 01/11/2010 952 orange_Boardings 877 2 Monday 01/11/2010 952 orange_Alightings 1027 3 Monday 01/11/2010 952 orange_Average 952 4 Monday 01/11/2010 952 purple_Boardings NA 5 Monday 01/11/2010 952 purple_Alightings NA 6 Monday 01/11/2010 952 purple_Average NA 7 Monday 01/11/2010 952 green_Boardings NA 8 Monday 01/11/2010 952 green_Alightings NA 9 Monday 01/11/2010 952 green_Average NA 10 Monday 01/11/2010 952 banner_Boardings NA # … with 13,742 more rows
In tidyr, the pivot_wider function spreads rows into columns. Now we have a long data set, but we want to separate the Average, Alightings and Boardings into different columns:
wide = long %>% pivot_wider(names_from = "type",
values_from = "number")
wide
# A tibble: 4,584 x 7 day date daily line Boardings Alightings Average <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> 1 Monday 01/11/2010 952 orange 877 1027 952 2 Monday 01/11/2010 952 purple NA NA NA 3 Monday 01/11/2010 952 green NA NA NA 4 Monday 01/11/2010 952 banner NA NA NA 5 Tuesday 01/12/2010 796 orange 777 815 796 6 Tuesday 01/12/2010 796 purple NA NA NA 7 Tuesday 01/12/2010 796 green NA NA NA 8 Tuesday 01/12/2010 796 banner NA NA NA 9 Wednesday 01/13/2010 1212. orange 1203 1220 1212. 10 Wednesday 01/13/2010 1212. purple NA NA NA # … with 4,574 more rows
pivot_longer